Unlocking App Success: A Data-Driven Exploration with Python
When I came across this dataset from Kaggle, it felt like finding a hidden gem. As a data analyst focused on providing insights and solutions to marketing issues, I saw an opportunity to leverage this data and offer strategic recommendations on the kinds of apps that are most likely to succeed.

Table of Content

Introduction

About the Dataset

Dataset Description

Objectives of the Notebook

Step 1: Data Preparation

1.1 Exploring the Dataset

1.2 Format and Types of Variables

Several issues arise when attempting to convert variable to their appropriate type.

1.2.1 Correcting Values in a Row

All values on this row shifted to the left and should be fixed.

This leaves a gap in Category. By looking for more information on this app, I found that it is a photo frame app, so I reattribute its category as "LIFESTYLE" and genre as "House & Home".

1.2.2 Genres: Separating Values

Several genres are currently fitted into one variable, I will separate them into different columns and convert them as category types.

1.2.3 Size: Converting KB to MB

Values in Size are listed in kilobytes (KB) and megabytes (MB). For an easier analysis later, I will convert the KB to MB. Since, 1MB=1024KB, I will divide the KB by 1024.

Values in megabytes have been appropriately formatted, and values in kilobytes appropriately formatted and converted.

1.2.4 Installs: Formatting and Converting to Integer

A range is given for the number of installs:

1.2.5 Type: Verification of the Classification

1.2.6 Content Rating

As most apps are rated for “Everyone”, this variable may not be very useful for analysis. This suggests that apps are generally not targeted at anyone under the age of 10.

1.2.7 App: Handling Duplicates

Of the 10,841 values in the App, 1,181 were duplicates.

Not every duplicated App had different values in other variables. However, when this was the case, it was predominantly in Reviews.

I applied a function that:

The new dataframe, without duplicated Apps, contains 9,660 entries. This confirms the success of the function, as 10,841 minus 1,181 equals 9,660.

1.2.8 Category and Genre: Overlap and Differences

While some categories may also appear in genres, genres will offer a higher level of detail.

1.3 Checking for Outliers

1.3.1 Visualization of Outliers

1.3.2 Outliers in Reviews

What may initially appear as outliers in the Reviews are actually the most popular applications from Meta (formerly Facebook). These apps have a significantly higher number of reviews due to their widespread usage and popularity. It's a common occurrence in data analysis where popular items can seem like outliers. It's always important to understand the context of the data.

This data suggests that users have the option to rate an app without necessarily leaving a review. Consequently, we can only quantify the number of reviews, not the total number of ratings.

It’s plausible that if the app was recently launched, early adopters might have rated it without leaving a review. This is a common behavior among users who are still exploring the features of a new app.

However, it would indeed be unusual to observe a high number of installs with no reviews. This could potentially indicate issues with the review system or user engagement strategies of the app. It’s always crucial to consider these factors when analyzing app performance data.

This analysis suggests two possibilities:

Given that we can’t delve deeper into this (since it’s impossible to determine if they were just released without a release date), and considering that their number is low, it seems reasonable not to remove these entries. This approach allows us to preserve as much data as possible for our analysis.

1.3.3 Outliers in Installs

The number of installs having been converted from a range to a number, I will display all apps with the highest number of installs.

These are again app from GAFAM: Google (Alphabet), Facebook (Meta) and Microsoft. The only other one, Subway Surfers, was the most downloaded game of the last decade (source: App Annie).

image.png

1.3.4 Outliers in Size

Around 2018, 100MB was indeed the maximum storage allowed for APKs (source: Android Developers Blog).

1.3.5 Outliers in Price

image.png

What started as a joke in 2008 (an useless app at an extremely high price, quickly removed from the App Store) has been replicated in different versions found here on Google Play at the time the data was scrapped.

1.4 Handling Missing Values

1.4.1 Converting "Varies with Device" Values as NaN

Values that read as “varies with device” should be considered as missing values. It’s interesting to note that we only find them in variables related to technical aspects. Regardless, they will be converted to NaN.

1.4.2 Identifying NaN

The prospect of losing almost 30% of the initial dataset can be daunting. However, only 15% of these are in the Rating category, which is the variable we will predict.

1.4.3 Imputation of Missing Values

The percentage of missing values in categories being low, I decide to use the simplest method of imputation by mean or median:

Step 2: Data Exploration

2.1 Free Apps vs Paid Apps

2.2 Categories and Genres Analysis

2.2.1 Categories vs Rating

FAMILY, GAME and TOOLS categories account for almost 40% of all apps.

Rating seems fairly distributed accross genres.

Best performing app categories will have a high user satisfaction (high median) but also a more consistent user satisfaction (low IQR).

In other terms, you'll be more likely to have a high rating, and less likely to have a low rating in these categories:

But you should avoid categories like:

2.2.2 Genres vs Rating

While most genres are typically represented only within their respective categories (for example, ‘Social’ in ‘SOCIAL’), the two largest categories, ‘FAMILY’ and ‘GAME’, are more widely distributed. They are spread across 29 and 18 genres, respectively.

Genres that are more likely to satisfy customers are:

While the most unlikely are:

2.2.3 Categories vs Price

2.2.4 Genres vs Price

2.2.5 Genres vs Size

2.3 Rating vs Other Variables

2.3.1 Correlation Matrix

2.3.2 Rating vs Reviews Size and Price

While a high size could deter from installing apps (limited storage on mobile devices), most of the largest apps seem to be highly rated.

With a R-squared of 0,003, Reviews, Price and Installs don't explain much of the variation in Rating and building a model to predict Rating won't be possible.

Conclusion

Overall, the rating does not seem to be influenced by other variables. However: